-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850
[RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change #850
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Would appreciate any hints on how to add a unit test for this, or whether this is the right approach! |
Maybe we can refer to the existing tests in [1][2] and write tests to ensure [1] https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller_fake_test.go |
@@ -233,6 +233,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc | |||
r.Log.Error(err, "Update status error", "cluster name", request.Name) | |||
} | |||
} | |||
if instance.Status.State == rayiov1alpha1.Pending { | |||
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil | |||
} | |||
|
|||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always requeue. I didn't realize we didn't do that. That's very bad. Reconciliation for any controller should happen periodically even if there's isn't a trigger that demands it.
The correct fix for this issue is to change the "successful return" from
return ctrl.Result{}, nil
to
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks, that's very good to know. Let me update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
@@ -858,6 +861,8 @@ func (r *RayClusterReconciler) updateStatus(instance *rayiov1alpha1.RayCluster) | |||
} else { | |||
if utils.CheckAllPodsRunnning(runtimePods) { | |||
instance.Status.State = rayiov1alpha1.Ready | |||
} else { | |||
instance.Status.State = rayiov1alpha1.Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's normal for a Ray cluster to gain pods that are in Pending state, particularly if autoscaling is enabled. With this change, the cluster could switch back and forth between Running and Pending. That seems undesirable. (The current reporting of state is not great either.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DmitriGekhtman
How do we differentiate between Pods are that are Pending due to autoscaler vs Pods being Pending during initial cluster setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the following funky logic:
The only cluster State
transition in the current logic is ""
-> Ready
.
If, after the initial cluster setup, we have a Pending
pod due to scale change or failure, the State
will not return to ""
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave the state reporting alone for now (maybe track the issue of making it more sensible).
We should change things so that we always requeue, i.e. do
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, nil
I am fuzzy on how these filters work, but the way things should work is that Currently, it might be that neither event triggers a reconcile. |
This reverts commit 368dd36.
Signed-off-by: Archit Kulkarni <[email protected]>
@DmitriGekhtman Updated to always requeue the RayCluster reconciler. After this, it seems the RayJob reconciler also gets called every few seconds even after the job has succeeded. I'm not exactly sure what is causing the RayJob reconciler to get requeued, but it sounds like this is what we want anyway. There's ~10 other instances of |
Probably the RayCluster status update is causing that. |
We could say testing for the fix falls under the umbrella of e2e testing for Ray Jobs which is tracked in #763. |
I guess by default it triggers once every 10 hrs I'm not sure if once every 2 seconds is too frequent to requeue by default, but I think this should work fine for now. In a follow-up, we should fix the event filters to ensure that pod changes trigger immediate reconciliation. |
one thing I like to confirm is
in this case, is it meaningful to submit the job? if head is running and partical workers are pending. I think move forward job submission makes sense. We also need to be careful that we have to honor gang setting as well |
the latest change actually will reconcile an object which reaches the end state and it's meaningless. I would say it brings side effect especially when there're lots of CRs in the cluster. I think there're better ways to solve the problem(make the changes specific to job instead of clusters globally)? |
I think we messed up the event filters in a previous PR -- we should fix the filters so that changes to RayCluster status are ignored but changes to status of other watched resources (such as pods) are not ignored. About periodically re-triggering reconciliation -- |
So, again, given the above discussion -- my current opinion is that we should try to solve the issue by fixing the event filter so that pod status changes trigger reconcile. Separately, we can consider if we want to always requeue after 30 seconds or a minute. |
Hi @DmitriGekhtman, I am not sure whether the Pod status change is enough or not. For example, if a ClusterIP service for a head Pod is deleted accidently and the head Pod's status does not change, the service may not be created again. |
I guess with the 2 or 30-60 second automatic reconcile, the service would be created again--does this resolve the concern? So we would have both this automatic reconciling as well as the reconciling upon Pod status change. |
Right, I'd recommend both fixing the pod status trigger and requeueing after a minute after successful reconcile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry, "Requesting changes" again to indicate my intent -- @Jeffwan had a good point that unconditionally triggering reconcile every two seconds is overkill.
Instead we should
- Unconditionally trigger reconcile once a minute. (Modify the 2 seconds in the currently altered line to 60 seconds.)
- Make sure pod status changes trigger reconcile, by fixing the event filter.
Once this is fixed, we could possibly consider a patch 0.4.1 release to fix Ray Jobs.
Signed-off-by: Archit Kulkarni <[email protected]>
Manually tested so @DmitriGekhtman I think this is ready to be merged. I think it makes sense to fix the filter issue #872 in a second PR instead of this PR because they're conceptually separate changes, but I don't feel strongly either way. |
…reconcile-when-pending
Makes sense to leave the filter fix for later. For posterity, could you update the PR title and description to explain the changes? |
@DmitriGekhtman Ah thanks for the reminder--should be good to go now. |
For customers that are using KubeRay and managing 100's or thousands of clusters, a 60 reconciliation loop would have a big impact on the operator itself, and on the traffic to the k8s Api-server. I think this should be a configurable attribute, we can default it to 5 minutes, for example we can pass an Env var to the operator that can specify a different frequency instead of hardcoding it. |
@akanso That makes sense to me, so something like this? (cc @DmitriGekhtman) var requeueAfterSeconds int
requeueAfterSeconds, err := strconv.Atoi(os.Getenv("RAYCLUSTER_DEFAULT_RECONCILE_LOOP_S"))
if err != nil {
requeueAfterSeconds = 5 * 60
}
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil And how would an environment variable typically be passed to the operator? |
We can simply pass it in the k8s Yaml manifest of the operator pod. if it is missing we default to 5 minutes. |
5 minute default sounds good |
Doesn't reconcile trigger on any pod spec change, though? |
Signed-off-by: Archit Kulkarni <[email protected]>
Successfully tested manually with the latest change. |
it does. But imagine kuberay is down or restarting, then it will/can miss an event, such as the creation/modification/deletion of a new raycluster, this reconcile logic will make sure missed event are processed. The events already processed will be ignored since the desired state is already satisfied. |
I meant that the logic for 5 min vs. 1 min for performance reasons in a large cluster might not make sense, given that a large cluster has many pod changes happening -- these pod changes trigger many reconciles anyway. |
@DmitriGekhtman "Go-build-and-test / Compatibility Test - Nightly (pull_request)" is failing, but I think this is known to be flaky. Maybe we can bypass the test and merge this PR anyway? |
Failing test is known to be flakey. |
…ery 60s instead of only upon change (ray-project#850) Have the RayCluster controller resume once per 5 minutes. Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni [email protected]
Why are these changes needed?
This PR fixes an issue where RayJobs would never be submitted even if the Ray Cluster was up and running. Debugged this with @brucez-anyscale .
The root cause is the following:
The cluster status is only marked "Ready" if all pods are RUNNING. The cluster state only gets updated when a reconcile happens.
However, the Cluster reconcile doesn't get requeued if all the pods are in [RUNNING, PENDING] state.
In particular, when the head pod becomes PENDING, the cluster never gets reconciled again.
So the cluster is stuck in a state where it's status is not "Ready". Since the RayJob reconciler waits for the cluster status to be "Ready" before submission, the job hangs and is never submitted.
The fix in this PR is to
requeue the cluster reconcile in the case where the cluster pods are not all running yet. This PR introduces a new cluster state "Pending" for this case.always requeue the cluster reconciliation (every 300s, for now.)As for why the head node state change PENDING -> RUNNING didn't trigger a cluster reconcile before, we think it might be due to an event filter
kuberay/ray-operator/controllers/ray/raycluster_controller.go
Line 803 in 633ff63
which only does the reconcile when there's a config, label or annotation change. However we can't remove this filter because it's necessary to prevent reconciling in a tight loop, because
instance.Status.LastUpdateTime
is updated at each reconcile.Conclusion after discussion on this PR: Fixing the event filters is a separate but related bug and will be addressed in a followup PR, the issue will be tracked here: #872. Once that is fixed, we could possibly consider a patch 0.4.1 release. After this PR, there may be an up to 60s unnecessary delay between when the cluster is ready and when the Job is submitted, which is not ideal.
Related issue number
Closes #845
Checks